This document serves as a reference for the work done on cotton fiber RNA data as part of a post-doc done in the lab of Jonathan F. Wendel at Iowa State University. I will try to argue that there is essentially mimicry between the homeologous co-expression networks of the At and Dt sub-genomes of domesticated polyploid cotton Gossypium hirsutum L.
The work focuses on Fiber RNA gathered over a time course of cotton fiber development, from 5 to 20 DPA in increments of 5. We have three repeats at each time period for each condition such that for the polyploid G. hirsutum we have 12 samples for domesticated and 12 samples for the wild condition.
Firstly we will cover Homeologue DGE analysis.
This analysis focuses on calling differential gene expression between homoeologs. In this instance I used R and the script HomeoTtest to assess differential expression. Essentially this uses mornalized read counts to assess differnential expression via a t-test. This takes advantage of the biological reps and the normalization of homoeologous counts (performed using HomeoNorm). The normalization takes the total library size of each rep, and then ensures the the ratio of At, Dt and N is maintained but the reps themselves are normalized.
Here is the output of that analysis:
From this data it seems that Dt has more over expressed homoeologs than At. I performed a binomial test to examine statistical deviation from equality between up-regulated in At vs Dt. The test was significant for 5, and 15 DPA, but not for 10 to 20 DPA. However, when combining all the counts for each DPA the result is also significant:
binom.test(1020,1800,p=0.5)
##
## Exact binomial test
##
## data: 1020 and 1800
## number of successes = 1020, number of trials = 1800, p-value =
## 1.687e-08
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.5434036 0.5897121
## sample estimates:
## probability of success
## 0.5666667
I have tabulated all the gene the exhibiting expression bias in homoeolog usage in each stage which can be supplied as supplementary file in the main document.
Interestingly the fact remains that there are relatively few genes where homoeologs are deferentially expressed. Out of the 16,611 genes considered in this analysis at most 951 are deferentially expressed at one stage (5 DPA in this case), the other stages have between only 5 (20 DPA) and 568 (15 DPA) homeologs that show significant boas in expression.
The next question (which you asked me to address, Jonathan) is the prevalence of cases where homoeolog expression switches over time. For example, at the start of fiber development At dominates, but is superseded by Dt at later stages. To achieve this is used the R script HomeoSwitch. In this case I take the mean expression of each homoeolog at each stage of development and calculate the mean percentage contribution of At and Dt to total expression. So in this case the analysis compares the relatvive contribution of homoeolgs to total gene expression and is actually independent of the degree to which a gene is expressed. There are a number of restrictions in the analysis which are detailed in the script, but are also listed below:
Screening for genes that fulfill this criteria revealed that 506 genes show evidence of switching, a surprisingly small number. This means that generally most genes maintain a similar balance between homoeolog usage through development. This can be seen in the following plot where the contribution of Dt is plotted against developmental stage:
It can easily be seen that the vast majority of genes have roughly equal balance between homoeologs, hence the thick black band across the middle of the plot.
Narrowing the analysis to genes that do show evidence of switching and using hierarchical clustering via euclidean distance (of homoeolog contribution) revealed several patterns of homoeolog switching during development.
I constructed co-expression networks using the WGCNA package implemented in R using expression data from At, Dt and A (sub-genomes). The processes involves building a correlation matrix of gene expression and then using hierarchical clustering to group genes based on that correlation. Details can be found in this paper. Parameters used during network analysis include:
The analysis was performed using the script HomeoNetwork.R. The resulting gene expression networks can be depicted as a cladogram with module membership indicated by the colored panel below. Firstly I assembled co-expression networks for At and Dt homeologs separately (It is important to realise the colors in each of the cladograms do not represent equivalent modules in the two different analyses).
In order to generate the networks I tried a huge number of different parameters, with the analysis shown appearing to include the most genes into co-expression networks and also grouping the genes in a more coherent way.
I thought it might be fun to take a look at the At and Dt co-expression networks and see how they look in 3D space. Using a correlation matrix, and coloring each gene according to the WGCNA module designation in Dt, I was able to create the following 2D projections of 3D networks using the R package igraph and the script construct_igraph_netwrok_homeologs.R.
At the moment you cannot see the edges between nodes (the are there but far too faint). In this plot it is important to realise that the colors DO mean the same thing, as genes are colored by their module designation in Dt in both plots. What this reveals is that largely the modules colors are co-localised in both the At and Dt networks, although the Dt network is tighter than the At counterpart (no surprise given that the genes are colored according to Dt designation). This means that genes in one module in Dt are also co-localised in the At plot indicating that co-expression patterns are some what maintained between the two. This is a ground truthing issue. We seem to be recovering the similar patterns in At and Dt networks, which is encouraging. It gets more interesting later…
Module membership is the degree to which a particular gene correlates with expression to other genes in the network. If a gene is highly correlated with many genes within the same network then we can decuce that it is a hub-like gene (it has many significant correlations with genes in the same module). In the next analysis I examine the connectivity ( i.e. module membership) for genes in At and Dt networks. Firstly you can look at the intramodule connectivity between At and Dt homeologs and compare them. Interesting questions in this regard include:
From this plot you can see that there are cases where homoelogs show completely differnt patterns of connectivity (in the top left and bottom right of the plot). These genes are highly connected in one homeologous network, but with low connectivity in another homeologous network. However, overall neither of the two homeologous networks dominates over the other (as can be seen by the approximate 1:1 relationship between intramodule connectivity). Addtionally, most of the genes cluster and the bottom left of the plot, with low conncectivity on both homeologous networks.
We can compress the multi-variate expression matrix into a single eigengene value, and then compare this value between At and Dt homeologous networks as well as between different stages of development on a per module basis. Below are examples from two modules, the lightcyan and lightgreen modules.
As you can see the eigengene value varies by developmental stage, that is, there is clear developmental signal in the expression data. Importantly however, although there is variation along the developmental axis the values for At and Dt eigengenes are very similar at any given stage. From this we can conclude that the general expression of genes in lightgreen and lightcyan modules is extremely similar in At and Dt. Provided here are only two examples of eigengene values but the trend continues for all the other modules and a complete pdf of all modules can be found here